PROCEEDINGS 

°F SCIENCE 



LOFAR and HDF5: Toward a New Radio Data 
Standard 



o 

o 

Q 
o 



6 



> 

(N 
(N 



Kenneth R. Anderson* ^ 

University of Amsterdam (UvA), Amsterdam, The Netherlands 



E-mail: k . r . anderson@uva . nl 



Anastasia Alexov 

University of Amsterdam (UvA), Amsterdam, The Netherlands 



E-mail: a.alexov@uva.nl 



Lars Bahren 

Radboud University of Nijmegen, Nijmegen, The Netherlands 
E-mail: 



lbaehren@gmail . corn 



Jean-Mathias GriefSmeier, 

Netherlands Institute for Radio Astronomy (ASTRON), Dwingeloo, The Netherlands 
E-mail: [griessmeier@astron . nl 



Michael Wise 

Netherlands Institute for Radio Astronomy (ASTRON), Dwingeloo, The Netherlands 

Gerard Adriaan Renting 

Netherlands Institute for Radio Astronomy (ASTRON), Dwingeloo, The Netherlands 

For decades now, scientific data volumes have experienced relentless, exponential growth. As a 
result, legacy astronomical data formats are straining under a burden not conceived when these 
formats were first introduced. With future astronomical projects ensuring this trend, ASTRON 
and the LOFAR project are exploring the use of the Hierarchical Data Format, version 5 (HDF5), 
for LOFAR radio data encapsulation. Most of LOFAR's standard data products will be stored 
using the HDF5 format. In addition, HDF5 analogues for traditional radio data structures such as 
visibility data and spectral image cubes are also being developed. The HDF5 libraries allow for 
the construction of distributed, entirely unbounded files. The nature of the HDF5 format further 
provides the ability to custom design a data encapsulation format, specifying hierarchies, content 
and attributes. The LOFAR project has designed several data formats that will accommodate and 
house all LOFAR data products, the primary styles and kinds of which are presented in this paper. 
With proper development and support, it is hoped that these data formats will be adopted by other 
astronomical projects as they, too, attempt to grapple with a future filled with mountains of data. 
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1. Introduction 

The promising advent of the LOFAR telescope's operational epoch holds forth both great 
scientifiic potential and challenges to current and legacy information technologies: volume and 
complexity of the data will continue to push the envelope of commonly used data protocols. 

Recognizing that this envelope is already strained, the LOFAR project has embarked on an 
ambitious project to design and define a set of radio data standard formats that are capable of 
encapsulating the full spectrum of, not just LOFAR data products, but astronomical radio data in 
general. 

It is with this ambition in mind that the LOFAR data formats group have been developing 
these format specifications and associated software infrastructure, an effort now ongoing for over 
two years. It was determined that HDF5 would be a robust, viable data framework that can handle 
the size, scope, diversity, distributed nature and parallel I/O processing requirements of LOFAR 
data. This work also has potential use beyond the radio community. New large scale optical 
telescopes such as the LSST are also investigating the viability of using HDF5. Furthermore, the 
20 year history of HDF and its continuing use by NASA's earth orbiting/observing missions ensure 
broad, ongoing use and support. 

In addition to the format descriptions themselves, the LOFAR project is currently developing a 
set of software tools for creating and working with these formats. The Data Access Library (DAL) 
in C++, along with an associated Python interface (pyDAL), are designed to allow for the easy 
construction and manipulation of these data formats. There are also a number of tools already 
available to read and visualize HDF5 files, such as HDFView, ViSiT, Py Tables, h5py and IDL. 

2. Data Files in the Modem Age 

With the invention and subsequent commercialization of the charge-coupled device (CCD), 
data volumes in the field of astronomy have grown exponentially since the early 1970s (Borne, 
2009), and, as Table 1 conveys, have now attained volumes that early file and file system protocols 
are finding difficult to handle. For future radio astronomy projects, the looming phght is especially 
true. 



Epoch 


Nominal hle data volume 


1970 


2io bytes 


1980 


2^0 bytes 


1990 


230 ^jy^gs 


2000 


2'^ bytes 



Table 1: Typical file sizes of scientific datasets have exponentially increased since the 1970's. 

Indeed, the path of diminishing utility of legacy protocols is clearly delineated. This looming 
predicament is especially germane to the SKA pathfinder LOFAR project, wherein certain opera- 
tional modes will be capable of generating datasets comprising hundreds of gigabytes to tens of ter- 
abytes. Therefore, the LOFAR project has been driven to consider viable alternatives to "standard" 
astronomical data formats, such as FITS. Though FITS development has sought to add capabiUty 
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to the protocol as needed (ESO-FWG, 2008), such ex post facto approaches should not be expected 
to remain viable in the future. 



tXPOSURE 


JN UMBER OF 


JN UMBER OF 


rILE ISIZE 


rILE ISIZE 


Time 


o U B B A N U S 


o 1 Ai lo^ s 


J\.NO\VN iVlOUh 


.5hAKL H iVlODh 


1 min 


248 


5 


11.2GB 


56GB 


1 min 


248 


20 


11.2GB 


244GB 


10 min 


248 


5 


112GB 


560GB 


10 min 


248 


10 


112GB 


1.1TB 


10 min 


248 


20 


112GB 


2.2TB 


10 min 


248 


30 


112GB 


3.3TB 


20 min 


248 


5 


r\r\ A /~~\ 

224GB 


1.1TB 


30 min 


248 


5 


336GB 


1.7TB 


1 hr 


248 


5 


672GB 


3.4TB 


1 hr 


248 


10 


672GB 


6.7TB 


1 hr 


248 


20 


672GB 


13.4TB 


1 hr 


248 


30 


672GB 


26.8TB 


2hr 


248 


5 


1 .3TB 


6.7TB 


12 hr 


248 


5 


8.0TB 


40.3TB 


12 hr 


248 


15 


24.0TB 


120.1TB 



Tkble 2: Sample, LOFAR Beam-formed dataset sizes 



Recognizing that legacy file technologies will eventually fail under the growing demands of 
newer and larger scientific datasets, while simultaneously noting that if observational modes of 
SKA pathfinder projects Uke LOFAR suggest anything, it is that legacy technologies would have 
to be abandoned, not just because of the volumes of data involved, but also because of the complex 
nature of the data per se. 

By way of LOFAR example, Table 2 presents expected data volumes for configurations of 
Beam-Formed data observational mode. A once useful unit, the megabyte has been left in the wake 
of next generation science. 

3. LOFAR Data Formats in HDF5 

3.1 LOFAR data and the use HDF5 (Hierarchical Data Format 5) 

Datasets produced by LOFAR observations will vary tremendously in size. Images, Beam- 
formed data, Transient Buffer board (TBB) time-series data are expected to produce large files, 
with the beam-formed and TBB potentially forming files of several tens of terabytes. This, com- 
bined with the complex nature of the data from certain modes of observation, led the LOFAR 
project to examine and consider the Hierarchical Data Format (version 5) as a robust and viable 
solution to the volume-complexity problem: 

- HDF5 presents a robust data model, featuring distributed files. 
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- HDF5 data model: "a file system for your data." which operates through POSIX-style file navi- 
gation, e.g., path names to objects. 

OBSERVATION/ImageOOl/Data 

- Versatile data model accommodates complex data objects and associated metadata. 

- User-defined file design. 

- Portable, unbounded file format. 

- Multi-platform software hbrary for single to massively parallel systems. 

The following schematic diagrams present two LOFAR file designs using the HDF5 frame- 
work, while readers should bear in mind that all LOFAR data products have been designed with 
maximal parallel form. 
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Figure 1: LOFAR Beam Formed High Level Data Structure; tables/arrays are not shown, but are implied at 
the Stokes level. (Alexov et al, 2010) 
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Figure 2: LOFAR Radio Sky Image Cubes, High Level Data Structure; tables/arrays are not shown, but are 
implied at the Data group level. (Anderson et al, 2010) 



4. LOFAR Data Format Specifications 

HDF5 provides a framework allowing users to essentially design their own files to appropri- 
ately encapsulate and characterize a variety of data. 
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For ongoing years now, the LOFAR project has been engaged in developing and designing 

a complete set of specifications for all LOFAR observational data. This has necessarily required 
differing file designs for differing data, with a certain structural parallelism maintained across all 
file designs. These Interface Control Documents (ICD) (shown with their document indentifiers) 
provide detailed descriptions of all expected LOFAR Data Products.^ 



LOFAR Data File type 


LOFAR Document ID 


TBB Time Series Data 


LOFAR-USG- 


ICD- 


001 


Beam-Formed Data 


LOFAR-USG- 


ICD- 


003 


Radio Sky Image Cubes 


LOFAR-USG- 


ICD- 


004 


Dynamic Spectrum Data 


LOFAR-USG- 


ICD- 


006 


Visibility Data 


LOFAR-USG- 


ICD- 


007 


RM Synthesis Cubes 


LOFAR-USG- 


ICD- 


008 



Table 3: LOFAR file types and the corresponding Interface Control Documents 
All LOFAR ICD documents are publicly available and can be downloaded from 

http: //usg. lofar . org/wiki/doku . php?id=documents : lof ar_data_products 

5. Toolsets, Libraries, Packages 

Since inception in 1988, the body of libraries and tools available to work with HDF5 files has 

grown substantially. Continued use, maintainance and development is assured with the adoption 
of HDF by Boeing, NASA - for the Earth Observing System (EOS) - while the National Oceano- 
graphic and Atmoshperic Administration adopted HDF for the National Polar-orbiting Operational 
Environmental SatelUte System, NPOESS. 

In addition to the HDF5 Ubrary itself, these packages and interfaces are also available: 



Package 


Location 


IDL 


http : / /www . ittvis . com/ 


ViSit 


https : //wci . llnl . gov/ 


HDFView 


http : //www . hdf group .org/ 


DAL 


http :/ /usg . lofar . org/ 


pyDAL 


http://usg. lofar. org/ 


h5py 


http : / /hSpy . alf ven .org/ 


Py Tables 


http : / /www . py tables . org/ 



Table 4: Some available HDF5 packages 



The LOFAR project is developing the C++ Data Access Library (DAL) and an associated 
python wrapper, pyDAL, which will provide full scope constructors for creating and accessing 
LOFAR data products. Further tools and packages are catalogued at the HDF Group website, 
http : //www . hdf group .org/tools5desc. html 

1 Interface control documents LOFAR-USG-ICD-002, and LOFAR-USG-ICD-005 provide supplemental specifica- 
tions for LOFAR naming conventions, and therefore are not included in the listing. 
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6. Summary and Future Considerations 

In order that adoption of HDF5 in astronomy prove useful in the real world, the LOFAR 
project is committing resources to help develop the next generation of astronomical tools for, not 
only LOFAR data, but other SKA pathfinder projects and, more broadly, astronomical radio data 
in general. Though our immediate goal is, of course, to meet LOFAR's many and varied scientific 
requirements, we hope these HDF5-based formats will be more broadly useful to our colleagues in 
the radio community for their data products as well. 

The large effort is the development of the Data Access Library (DAL), which will ultimately 
provide interfaces through FITS, the Casa/AIPS++ Measurement Set and LOFAR HDF5 formats. 
A python interface to the DAL, pyDAL, is pending development of the DAL. All LOFAR products 
will be accessible through DAL and pyDAL tools. 

Future work involves developing an interface for DS9 and HDF5 LOFAR data files — this 
will allow users to open and examine LOFAR data with the de facto standard astronomical image 
viewer. 

Ultimately we would like to see these formats grow into a true set of standards for radio 
data that can meet the demands of the next generation of radio observatories. Such standards are 
something sorely lacking in the radio community at present and something we will certainly need 
as we move into the SKA era. 

The large effort by LOFAR to design an HDF5 radio data standard is driven in great part by 
consideration that there are no effective standards for astronomical radio data. And, as indicated 
earUer, the expected data volumes produced by LOFAR will, in many cases, swamp currently 
employed file technologies. It is this reaUty that has led LOFAR on this work, and it is hoped that 
other institutes and telescope projects will join this effort toward building a radio data standard, 
one essential to a cooperative future for radio astronomy. 
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